HR ANALYTICS by Coslovich Simone

Introduction

In this project i decide to get an exploration of the dataset “Human Resources Analytics” finded on Kaggle (https://www.kaggle.com/ludobenistant/hr-analytics). My purpose isn’t to predict the next employee that leaves, but only understand what are the main causes of the employees leaving.

First, i started with a preliminary exploration of the dataset and its variables.

Univariate Plots Section

This dataset is composed by 10 variables that i’m going to explore in the next line.

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...



Whit this visualization of the data, we can see that there are two numeric variables, satisfaction_level and last_evaluation; these two are scores from 0 to 1, with 0 the worst score and 1 the best score.
After that we see two factor variables: sales and salary; this two are category variables related to the salary (low, medium and high) and the department of the employee.
The other six variables are integer related to the years spent in the company, the average monthly hours, the number of projects (from 2 to 7), if the employee have a work accident (0 = NO, 1 = YES), if the employee has left the company (0 = NO, 1 = YES) and if the employee has a promotion in the last 5 years (0 = NO, 1 = YES).


Now i’m going to explore every variable with a plot and a summary. I started with the satisfaction_level.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0900  0.4400  0.6400  0.6128  0.8200  1.0000



In this graph we can notice that the density of the satisfation has a bimodal distribution, with two peaks, the first around 0.12 and the second around 0.76. In Bivariate and Multiavariate plots section i’m going to analyze this two peaks in order to find the causes and the effects of this distribution. With the summary, instead, seems to have a normal distribution with a median value of 0.644.
The second variable that i want to analyze is the last evaluation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3600  0.5600  0.7200  0.7161  0.8700  1.0000



Like the satisfaction level’s graph, the last evaluation’s graph has a bimodal distribution with a peak around 0.53 and another peak around 0.86. Before generate this graph, i thought that the last_evaluation distribution was a normal distribution with a peak around 0.7 and two decreasing tails for excellent and worst employees. The summary in this case, don’t describe correctly the visualization distribution and may lead to an incorrect analysis.

The third variable that i want to analyze is number of project. This time i prefer the histogram to a density chart.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.803   5.000   7.000



This graph has a distribuiton that is similar with i thought before this visualization; there is a peak in 3 and 4 number of projects for every employee, a long right tail on the right (the number of employees with number projects over 5 are very small) and a short left tail. This graph has a positive skew distribution. With this variable, the visualization fits perfectly the summary.

The fourth variable is the average monthly hours

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    96.0   156.0   200.0   201.1   245.0   310.0



Like the first two variables, this variable as a bimodal distribution with two peaks, one in 150 and one around 270. This monthly hours is probably connected with the number of projects, but i’ll analyze if there is a correlation in the Bivariate plot section.

The next variable is the time spent in the company.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.498   4.000  10.000



This plot is quite interesting because it gives me a vision of the distribution of the years in company; this distribution could hypoteticaly tells me that there are a lot of people that leaves after 3 years in comapmny or the company analyzed has hired the mojority of the employee from 4 to 2 years ago. I’ll nalye the first hypotesis in the Bivariate plot section, but the secon hypotesis, isn’t possible to analyze because we don’t have the year of hiring to better understand this results.

I’ll represent the next three variables (work accident, left and promotion last 5 years ) in one grid.



In the first plot (Work accident), we can notice that in this company the number of employees with accident is above the 14% of the entire number and in the bivariate plots section i’ll going to analyze the department with the higher number and if it’s correlated with the leaves.
The secon plot (Left), describe the flow of leaves (1 employee on 4 leaves the company) and with this EDA i want to understand better the causes for the employee’s leaves.
Last but not least, the plot for the promotions. This plot describe a situation with a little percent of the employees with a promotion in the last 5 years and I want to explore if this variable is correlated with the leaves.

The last two variables are two categorical variables: sales and salary. Sales describe the departments in the company and salary, the salary of the employees.



In this two graph we simply notice that there are departments with an elevate number of employees like sales, support and techincal, and that the salary in thie company are principally low and medium.

Univariate Analysis Section

In this dataset, my goal is to understand the causes of the employees left, what are the charateristics of the employee that leaves; i think that the features that help me to reach my goal are surely the satisfaction level, the last evaluation and the ratio from average montly hours and the number of projects. So, before try to investigate better, i create a new variable that is the ratio between the average_montly_hours and the number_projects.

hr$ratio_hours_projects <- round(hr$average_montly_hours/hr$number_project,2)



Bivariate Plots Section

Before starting the bivariate plots section, i want to investigate what are the variable with the best or the worst correlation. So i put the data set in a heatmap.



With this heatmap, we can notice quickly the variables with the most considerable correlation; I starting analyze the first two correated varibales: the last evaluation and the number of projecs.



In this plot we can notice that the median value of the last evaluation level, increase with the increasing of the number of projects. Probably the evaluator give a better score for the hard workers that are on the higher number of projects. In red i decide to represent the mean value of the las evaluation level.
The second plot that i want to create is the last evaluation for the average montly hours.



In this plot we can notice an huge amount of data in the left low corner that represent the low evaluation level correlated to the low amount of hours, and an huge amount of data in the top right corner that represent the high evaluation level correlated to the high amount of hours. The other data are very scattered, so with this type of plot che can’t notice a pattern for the value between this two cluster. In red i decide to represent the mean value of the las evaluation level.
The third plot that i want to analyze is the correlation between the number of projects and the the average montly hours.



As i can expect, the number of hours for employees is strictly correlated with the number of projects, especially for the employees with 2,6 and 7 projects. The mean of average montly hours of projects higher than 4 are above the mean value of the mtotal data montly hours. In the Multiavrite plots section i’ll analyze this two variables above, with a third variable: left.
And because my purpose is try to understand the cause for the left, now i’m going to plot the variable left, with the variables that had the best correlation.



The first compare is the satisfaction level with the left of an employee (0 = No, 1 = Yes). This scatter plot is more readable tha a box plot, because in the totality of the data in the boxplots, we don’t notice if there are some pattns. In the scatter plot, instead, we can notice that the people who stay in the company have a medium/high Satisfaction level, while the people who leaves the company, can be classified in 3 classes: the first classes are the employee with a low satisfaction level that probably change company in order to find a company that stimulate they in a better way; the second class is composed with the people under the 0.5 in the satisfaction level and the third class composed by the employee wih an high satisfaction level (this might be the employee that left beacause they want to pursue a growth career in other company).
Another varibales that i want to compare are the left variables and the work accident.



In this plot we can notice that the people don’t left the company because they have an accident; there are few points in the top right box that tell us this isn’t the way to bettere understand the causes of the leaves.
So the next two variables are left and time spent on company.



In this plot we can notice that the left is not strictly correlated with the years in company. The higher percentage of employee that leaves the company are the people that work for this company for 4, 5 and 6 years.
Other two similar plot i want to analyze are the left/salary plot and the left/sales plot.



In this two plots I don’t want to understand how many employees left for every salary groups or department, but the percentage of the employees that leave. In the salary plot, the employye that leave more freqeuntly than the others are the low salary’s employees, followed by the medium salary’s and the high salary’s. In Multivariate plots section i’ll analyze this ratio with another variable: the satisfaction level, that is highly correlated with the leaves.
In the second plot, sales vs left, we can see that the percentage of employee that leave the company is the same value for most of the department, with the exception of management and RandD.

#### Bivariate Analysis Section In this plot’s section i analyze principally the left variable compared with other variable. As we can see the left variable is strictly correlated with satisfatcion level and lower is the level of satisfaction, higher is the number of people that left the company.
The last evaluation level, instead, is correlated with the number of projects (if number of projects is higher, than the last evaluation score is higher) and the average montly hours (the hard workers are awarded with a high last evaluation score).
Before this analysis i thought that the employees that has left the company were the employees with a work accident in their career, but i was totally wrong.
The strong relationship that I found is the satisfation level correlated with left.

Multivariate Plots Section

In this section i want create and analyze four plots that i think describe very much this dataset.
The first that i want to create is the last evaluation on number of projects with the left variable.



Wow, with this two plots i better understand the left correlated with the last evaluation and the number of project. The employee with an high number of projects and high evaluation score tend to leave the company, probably for an advance in their career, especially the employees with 7 projects on their hands (they all leaving the company). In the opposite the employee with a low evaluation score and a low number of projects tend to leave the company, probably for a new job that better enhance their skills. Instead, the employee that stay in company, have pretty the same evaluation level even they have differetn number of projects in their hands.
Another plot that i want to visualize is the average montly hour on number of projects with the left variable.



With this plot we can notice that had workers, with an high number of average hours, tend to leave the company and the same for the people that are less engaged. Instead, the employees that stay in company, have around the same montly hours.
Another plot that I analyze in this section is the satisfaction level on last evaluation level with the left variable.



In this plot we can notice that there are three types of people that leave the company: the first one are the employee that have an high evaluation score with a low satisfaction level, the second are the employee that have a low evaluation level and a low satisfaction level and the third one are the employee with an high satisfaction level and an high evaluation level.
But this plot doesn’t explain the causes of the left, so I try to understand if there is a main cause that force the employee to leave. So the two variables that i want to plot are: salary and the number of project.
Frist i start putting the number of project in the previuos plot.



In this plot we can see a larger picutre of the situation. The people leaving with the less number of project, two, have a low satisfaction level and a low evaluation level. In the employee that follow three project there are only few data of people who left and without a specific pattern. For the empoloyee with four and five projects, the people who left the company have an hogh satisfaction level and an high evaluation level. Last but not least, the employye with ix or seven projects that left have a low satisfaction level, nut an high last evaluation level.
The number of project could be an interesting variable to better understand the future left. Another cause that i want to try to understand is how much salary change satisfaction level and consequently the leave of the company.



In this case, the plot don’t explain anything and this isn’t a great choice to understand the left of the people.

Final plot and analytics

In this section i choose three of the previous plot to better explain how this EDA is useful for me.

Plot One

The first plot that I want to explain is the left variable pn average montly hours for number of projects



In this plot we can see that the people who stay in the company, have similar average montly hours independently of he number of porjects. This people probably doing the job well, but not perfectly. Instead the peole who left the company, for the same number of projects are hard worker and spend more time on the projects probably to do better; there is only one exception, because the people that have left the company with two projects, are not good workers, with time spent montly very above of the average montly hours. In this plot we can suppose that the people with an hard work ethic, probably in the future will leave their job in order to search an advance in theri careers.

Plot Two

The second plot that I want to explain is the left variable on the years spent in the company



Whit this second plot the first thing that we can notice is the normal distribution of the percentage of left for every years spent in company. There is a peak in the percentage of emloyees that are in the company from five years.
So there is an high percentage of left for employees that are in the company from five years, (above 50%); so for the employee from 3 to 6 years in the company this is the best period to leave the company. From 7 to 10 years, none leave the company, probably for the attachement and lack of ambition.

Plot Three

The third plot that I want to explain is the left variable in the comparison of satisfaction level and the last evaluation level fro every number of projects.



In these plots we can notice that the employees that leave with a number of projects in their hands equal to two, havea low satisfaction level and a low last evaluation level, probably beacuase the are not good wrokers and their are less engaged than the others employees. Instead, there is a very low number of employees that leave the company with three projects in their hands. The workers that leave the company, with four and five number of projects, have an high satisfaction level and an high evaluation level; probably they are changing theri job to advance in their career. In the last, the workers that leave with an high number of projects (6 or 7) have an high evaluation level, because are good workers, but a very low satisfaction level, probably because they are overwhelmed by the job; The only workers that have 7 projects on their hand, all leave the company.
These plots help me to better understand what are the causes of a low satisfaction level and consequently the left of the company.

Reflection

In conclusion in this EDA, the struggles that i went through are to understand what are the variables that better help me to understand why the employees leave the company, choose the good plots and chose the right comparison of variables.
In this EDA went well the finding of a correlation beacuase most of the variables are correlated and very good to visualize.
It’s surpising that the salary isn’t a cause of the left from the company, but only how an employee are engaged in his work.
In the future with this dataset can be done a machine learning predictive analysis to better understand how an employee are on risk to leaves the company or, only if it is created a new variable for skill’s level of the employee, a model to better engage the employee with a correct numbers of projects.

Resources

N/A